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Abstract 

Background: The information in large collections of phylogenetic trees is useful for many comparative genomic 
studies. Therefore, there is a need for flexible tools that allow exploration of such collections in order to retrieve 
relevant data as quickly as possible. 

Results: In this paper, we present TPMS (Tree Pattern-Matching Suite), a set of programs for handling and retrieving 
gene trees according to different criteria. The programs from the suite include utilities for tree collection building, 
specific tree-pattern search strategies and tree rooting. Use of TPMS is illustrated through three examples: systematic 
search for incongruencies in a large tree collection, a short study on the Coelomata/Ecdysozoa controversy and an 
evaluation of the level of support for a recently published Mammal phylogeny. 

Conclusion: TPMS is a powerful suite allowing to quickly retrieve sets of trees matching complex patterns in large 
collection or to root trees using more rigorous approaches than the classical midpoint method. As it is made of a set 
of command-line programs, it can be easily integrated in any sequence analysis pipeline for an automated use. 



Background 

Comparative genomics is a central approach in sequence 
analysis, and many important biological results have been 
obtained through its use. Among the different programs 
and packages developed for comparative genomics, those 
using the information contained in phylogenetic trees are 
of special interest. Indeed, the explicative power brought 
by trees has still no match for problems like orthology 
detection. Horizontal Gene Transfer (HGT) prediction, 
or large-scale evolutionary studies. For orthologs detec- 
tion, the most rigorous approach to determine whether 
homologous genes are orthologous or paralogous consists 
in comparing a gene tree to the species tree considered as 
a reference [1-4]. Similarly, the same approach can be used 
for HGT detection [5]. Lastly, various kind of evolutionary 
studies have been made possible only through the use of 
large collection of trees, e.g., detection of positive selection 
in vertebrate evolution [6], analysis of DNA methylation 
levels in mammals [7], or GC-content evolution [8]. 
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The problem is that manual browsing of collections 
containing thousands of gene trees is a tedious - and 
now, almost impossible to perform - task, and auto- 
mated tools are required for large scale studies. Therefore, 
we have developed TPMS, a set of programs devoted to 
tree manipulation and query. Its central functionality is a 
pattern-matching algorithm aimed at finding gene trees 
containing specific motifs. Those patterns are written 
using an extended Newick format and they usually include 
some kind of constraint, such as node nature (duplication, 
speciation), subtree content, or level of statistical support 
for the nodes {e.g., bootstrap, jackknife or aLRT). The sec- 
ond main functionality brought by TPMS is a tree-rooting 
tool based on the use of a gene unicity score and taxo- 
nomic criteria. Its aim is to place the root of a gene tree in 
order to minimize the incongruencies observed between 
this tree and a reference species tree. 

The tree pattern search algorithm used by TPMS is 
an improved version of the one we previously published 
[1]. The main improvement is the possibility to query 
tree collections built by the users. Our previous version 
was implemented in FamFetch, a Graphical User Inter- 
face (GUI) that was only able to query a set of predefined 
databases installed on a centralized server [9]. 
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Implementation 

TPMS is a set of C++ command-line programs that 
require the Bio++ [10] and Boost (http://www.boost.org/) 
Ubraries to run and it is not dependent on the use 
of the gene families databases developed in our group 
[9]. Binaries are provided for Linux and MacOSX (Intel 
architectures only), as well as C++ source code and 
documentation. 

Tree collection building 

TPMS needs to be run on a collection built in the RAP 
format [1]. With this format, all the trees are included in 
a single text file. The header of this file consists in a ref- 
erence species tree in Newick format that contains taxa 
names on its internal nodes and leaves. Names on leaves 
correspond to species and names on internal nodes corre- 
spond to higher taxonomic groups. All taxonomic groups 
allow the use of synonyms. For instance, in the exam- 
ple file distributed with TPMS, the leaf corresponding to 
Escherichia coli strain K12 substrain W3110 is written 
as: 

"ESCHERICHIA COLI W3110 "/ "ESCHERICHIA COLI 
STR. K12 SUBSTR. W3110"/ 

"ESCHERICHIA COLI STR . W3 110 "/" ESCHERICHIA 
COLI STRAIN W3110" 

The different synonyms are separated by a slash and it is 
possible to use any of them when building a query. 

Individual gene trees are listed after the reference 
species tree. Each entry is written in Newick format and 
has associated information consisting in the tree name 
and a list of associations between the sequences names 
used at the leaves and their corresponding species names: 

HBG469283 .phb 
[ 

EC07I_1_PE647 "ESCHERICHIA COLI IAI39" 
ESCOL4_2_PE678 "ESCHERICHIA COLI STR. K-12 
SUBSTR. MG1655" 

ESCOL5_l_PE669"ESCHERICHIA COLI STR. K-12 
SUBSTR. W3110 

SHIBS_2_PE510"SHIGELLA BOYDII SB227" 
] 

(EC07I_1_PE647 : 0 . 00591, ( (ESCOL4_2_PE678 : 0 . 0, 
ESCOL5_l_PE669 : 0 . 0) -1 : 0 . 0, 
SHIBS_2_PE510:0.0) - 1 : 0 . 00591 )- 1 ; 

The species tree can contain unresolved nodes (multi- 
furcations), but not the individual gene trees since the tree 
pattern matching algorithm used only supports binary 
trees [1]. 

The program tpmsjnkdb is provided in order to pro- 
duce a collection in the suitable format. To run this 
program, the minimal requirements are a species tree 



file and a collection of gene trees in Newick format. 
Using the information provided in these two files, the 
user must then build another file containing the associ- 
ation between all the sequences names used in the gene 
trees and the species to which they belong. Note that 
tpmsjnkdb provides a functionality for facilitating this 
task. This functionality returns a ready-to-complete list of 
all the sequences names found in the gene trees files. Then 
the user has to fill the blanks with corresponding species 
names. 

Tree pattern-matching 

The tree pattern-matching algorithm itself is imple- 
mented in the tpms -query program. This is a C++ version 
of the one used in FamFetch, and a complete description 
of the algorithm can be found in [1]. Like in the original 
program, both the target tree and the tree pattern need to 
be rooted. 

Tree patterns have to be written in an extended Newick 
format, and they can include different criteria. Let v be 
a node from the pattern P, and let co and (p be the two 
sons of u. In our format, v is described by three strings 
separated by slashes. The first two strings contains the 
constraints put on u, whereas the third one - which is 
only suitable for leaves - describes the set of taxa allowed 
at this level. The simplest pattern only contains the list of 
taxa to be found on the leaves with their respective posi- 
tions. For instance, the pattern ( (//Homo sapiens, 
//Pan troglodytes) , / /Rodent ia) will find all 
the gene trees in which a subtree with sequences from 
Homo sapiens and Pan troglodytes species are grouped, 
while sequences from Rodents are located outside of this 
group. It is possible to accurately ask for the inclusion or 
the exclusion of specific taxa through the use of + and 
— signs. In this case, the pattern ( ( / /Homo sapiens , 
//Pan troglodytes), //Mammalia— Primates) 
will find all the gene trees in which a subtree with 
sequences from H, sapiens and P troglodytes species are 
grouped, while any sequences from Mammals that are 
not from Primates are located outside this group. This 
pattern is less constrained than the first one, and the 
trees it selects will include all those selected by the first 
one. 

It is also possible to introduce taxonomic constraints on 
one of the two sons of the node matching to v in the gene 
tree. This constraint is written on (jo, but is put on its father 
node since the node matching on co is not necessarily the 
direct son of the node matching on u. This kind of con- 
straint is put after the first slash, and examples of accepted 
and rejected topologies for a simple pattern are given in 
Figure 1. 

Constraints other than purely taxonomic ones can be 
introduced on nodes. First, it is possible to search for pat- 
terns including a threshold for internal branches statistical 
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B. taurus 



P. troglodytes 



M. musculus 



R. rattus 




H. sapiens 



P. troglodytes 



M. musculus 



R. rattus 



(/Primates/Pan troglodytes, //Mus musculus) ; 

Figure 1 Example of a subtree constraint. In the pattern shown in the lower part of the figure, a constraint has been put on the node leading to 
P. troglodytes. The pattern is shown in red, and the subtree constraint is shown in green. This pattern with this constraint will find all the trees in 
which a subtree with sequences from P. troglodytes and M. musculus species are grouped. Non-Primates are forbidden from V] , the common 
ancestor of P. troglodytes and M. musculus, to the node matching to P. troglodytes. Tree (A) is rejected, since the pattern does not allow B. tourus 
(which is not a Primate) to be in the subtree containing P. troglodytes (generated by V] ). On the other hand, tree (B) is accepted, as the subtree 
generated by V] contains only primates. 



support. For instance, the pattern (//Homo sapiens , 
//Mus musculus) $90 will select all the gene trees 
in which a subtree with sequences from H. sapiens and 
Mus Musculus are grouped, this with an associated boot- 
strap > 90%. Also, constraints on speciation or dupli- 
cation can be set on nodes if the program is running 
on a reconciled trees collection. This kind of nodes can 
be specified by the use of letters S or D. The pattern 
(//Homo sapiens, //Mus musculus) D will find 
all the gene trees in which a subtree with sequences 
from K sapiens and M, musculus are grouped, while 
the node that groups them is a duplication node. Lastly, 
it is possible to specify that a connection between two 
nodes in the gene tree is direct. This is specified through 
the use of the exclamation mark. Therefore, the pat- 
tern (!//Homo sapiens, //Mus musculus) will 
retrieve only gene trees in which the common ancestor of 
H, sapiens and Mus musculus has the leaf H, sapiens as a 
direct son. 

Tree rooting 

Tree rooting is an essential step in phylogeny as it orients 
the tree and enables the evolutionary history it summa- 
rizes to be interpreted. We have therefore implemented 
two rooting strategies in the tpms .computation program: 
the first one aims at maximizing the size of subtrees with 
unicopy sequences while the second one tries to maximize 
the accuracy of taxonomic affectations for nodes. The first 
approach introduces a score on the internal nodes of a 
rooted tree. Given a node v/ (/ = 1, 2, . . . , n — 1), its score 
Ui is obtained by the product of the number of sequences 
by species encountered in the subtree generated by the 
node considered. A score equal to 1 means that the cor- 
responding subtree only contains one sequence for each 



species. Now, let T be a rooted tree, its unicity score U^'^^ 
is equal to: 

i/OT = 2ln(«P) (1) 

i=l 

In a unrooted tree with n leaves, there are 2n — 3 pos- 
sible positions for the root. TPMS will try every position, 
and for each one, computes the unicity score of the cor- 
responding rooted tree. Then, the rooting that is kept is 
the one that minimizes U^^\ This approach favors dupli- 
cations closer to the root than to the leaves. It minimizes 
the number of duplication events, and is therefore based 
on a parsimony reasoning. This method can be useful to 
extract unicopy subtrees from gene families containing 
many copies for some species. It can automatically split 
trees containing paralogs into several subtrees. Figure 2 
shows two rooting examples of a simple gene tree contain- 
ing six sequences from three different species. 

The second rooting method uses the reference species 
tree included in the header of the tree collection. During 
the rooting procedure, each internal node v/ of the gene 
tree is associated with the taxonomic group correspond- 
ing to the Lowest Common Ancestor (LCA) in the species 
tree. Then, a distance di is computed for each node. This 
distance is equal to the number of nodes between the root 
of the species tree, and the node of the taxonomic group 
in the species tree associated with the node in the gene 
tree. Note that the taxa from the species tree that are 
not present in the gene tree are not taken into account 
in this distance computation. Therefore, the sum of node 
distances for a rooted tree T is equal to: 

i=l 
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Figure 2 Example of unicity score computing for a gene tree witli two different rootings. In this tree, we have six sequences O] , 02, b] , bj, 63 

and c. Sequences O] and 02 belong to species A; b] , 62 and 63 belong to species B; and c belongs to species C. Scores of internal nodes are given in 
the circles. In this example, the rooting for topology (A) gives the unicity score U'^^^ = ln(6) + ln(6) + ln(4) + ln(2) + ln(l ) ^ 5.663, and the rooting 
for topology (B) gives the score U^^'' = ln(6) + ln(4) + ln(l ) + ln(l ) + ln(l ) 3.1 78. The rooting in (B) is therefore preferred over the one in (A). 



Again, TPMS will try all the 2n — 3 rooting positions 
and will keep the one maximizing D^^^ Indeed, higher 
D^^^ values correspond to better taxonomic affectations 
(Le.y resulting trees that are more congruent to the ref- 
erence species tree). Figure 3 shows an example of two 
possible rootings for a bacterial gene tree, this considering 
a reference species tree. 

Usually, the two approaches are used in sequence, the 
first one is efficient to isolate clusters of unique sequences, 
but it can lead to many ex aequo. The problem is that, 
among those ex aequo, some of them can contain obvi- 
ously erroneous polyphyletic groups. On the other hand, 
the second approach will find the solutions that are more 
consistent to the species tree. It can be therefore used to 
solve those ex aequo in the right way, as it will remove 



the solutions that are in violation of the reference. If more 
candidates still remains after those two steps, the root is 
placed on the solution having the longest branch. In case 
of branch lengths equality, placement is arbitrary on one 
of the remaining possibilities. 

Data sets 

For the detection of tree incongruences in complete 
genomes, we performed the search on the 128674 gene 
trees available from HOGENOM release 5 (http://pbil. 
univ-lyonl.fr/databases/hogenom/). Those trees include 
a total of respectively 2789275 bacterial, 138474 archaeal 
and 738819 eukaryotic sequences. For the Ecdyso- 
zoa/Coelomata controversy and the mammals phy- 
logeny study, we used the 14190 gene trees available 



Bacteria 



(A) 



S. aureus 



Bacteria 
di=0 







E. coli 


Bacteria 
d2=0 










S. aureus 



P. aeruginosa 



(B) 



y-Proteo. 
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S. dysenteriae 
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S. dysenteriae 
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Figure 3 Example of distance score computing for a gene tree with two different rootings. The reference species tree is shown in (A), with 
the names of the taxonomic groups written on the nodes (Bacteria, y-Proteobacteria and Enterobacteria). The rooting given in (B) gives a sum of 
distances D^^^ = 1 + 0 + 0 = 1 , while the rooting given in (C) gives a sum of distances D^^^ = 1+1+0 = 2. The rooting in (C) is therefore 
preferred over the one in (B). 
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from HOMOLENS release 5 (http://pbiLuniv-lyonlir/ 
databases/homolens.php). For the first two studies, the 
reference species tree used was the one provided by 
NCBI (ftp://ftp.ncbi.nih.gov/pub/taxonomy/) and for the 
mammal phylogeny we used a slightly modified version 
of the tree published by Dos Reis et al, [11]. Those 
modifications were made to match the species tree avail- 
able in HOMOLENS as three species from [11] were not 
available in this database: Pongo abelii, Vicugna pacos 
and Canis familiaris. They were replaced respectively by 
Pongo pygmaeusy Lama pacos and a clade grouping Canis 
lupus familiaris and Ailuropoda melanoleuca. 

Results 

Incongruencies detection 

The program tpms -computation can be used to detect tax- 
onomic incongruencies. As seen above, each node in a 
gene tree can be assigned to its LCA, this using the infor- 
mation provided by the reference species tree. To detect 
the incongruencies, the approach used is to see if a leaf 
or a node in a gene tree induces a perturbation in this 
taxonomy affectation. Exploration of the tree starts from 
the leaves and goes up to the root. For each node, the 
algorithm masks the subtree it generates. If this masking 
leads to a taxonomic assignation change on its grandfather 
node, then this subtree is pruned. A perturbation index is 
associated to this incongruency, it correspond to the dis- 
tance (in nodes numbers and with ignoring the taxa that 
are not present in the gene tree) between the former tax- 
onomic affectation and the one realized after the masking 
process. This operation is repeated until all the existing 
incongruencies have been detected. Lastly, to be consid- 
ered as a true incongruency, the involved branch must 



have a support value greater than a threshold given by the 
user. 

A simple example of detection is given in Figure 4. 
In this example. Staphylococcus aureus, a Firmicute, is 
wrongly grouped with Escherichia coli, an Enterobacteria. 
Masking the node leading to S. aureus leads to a change 
from Bacteria to Enterobacteria in the taxonomic assig- 
nation of its grandfather node. As the number of nodes 
separating Bacteria from Enterobacteria in the species 
tree used, the perturbation index associated with this 
incongruency is equal to 2. 

Over the 128674 gene trees available in HOGENOM, 
tpms -Computation found 110359 incongruencies sup- 
ported by bootstrap values > 90%, this corresponds to 
an average of 0.86 errors per tree. Different events could 
explain the incongruencies observed: phylogenetic recon- 
struction errors (such as long branch attraction artefacts), 
errors in the NCBI species tree, wrong gene tree root- 
ing by TPMS, hidden paralogies (as the gene trees from 
HOGENOM are not reconciled), incomplete lineage sort- 
ing and HGTs. Due to that fact, it is remarkable that their 
frequency is relatively low, with less than 1 error per tree 
on average. 

The number of intra and inter-domain incongruencies 
detected is shown in Figure 5. An intra-domain incon- 
gruency is when a species is placed inside its domain but 
outside the taxonomic group it belongs to and an inter- 
domain incongruency is when a species is wrongly located 
in another domain than the one it belongs to. A good way 
to see if our method can be used to detect HGTs is to look 
at its ability to identify well-known transfers such as the 
ones observed between mitochondria or chloroplast and 
the nucleus of eukaryotes. For that purpose, we looked at 







E. coli 


Bacteria. 
d=0 








Figure 4 Example of incongruency detection in a subtree. The reference species tree is sliown in (A), witli tine names of tine taxonomic groups 
written on tine nodes. For eacli internal node, tine distance d to tine root is written in red. Tine subtree containing the incongruency is shown in (B1), 
with S. aureus being the misplaced taxon. In this subtree, the taxonomic group associated to the grandfather of 5. aureus is Bacteria and its distance 
to the root in the species tree is d = 0. In (B2) the father node of S. aureus is removed. The root of this subtree has now two sons E. coli and S. 
dysenteriae. The new taxonomic affectation of this node becomes Enterobacteria, which corresponds to a distance to the root d = 2, in the species 
tree. Therefore, perturbation induced by the misplacement of S. aureus is equal to 2 - 0 = 2. 
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19401 



Figure 5 Number of incongruencies inferred between and inside 
the three domains of life. The three domains (Archaea, Bacteria and 
Eul<aryota) are represented respectively by the letters A, B and E. For 
inter-domain incongruencies, the arrows give the direction of the 
switch observed. 



the incongruences found with the 374 eukaryotic genes 
labelled as mitochondrial in HOGENOM 5. Among those 
genes, 252 (67.4%) were found to be involved in an incon- 
gruency placing them among Bacteria. Similarly, among 
the 377 genes labelled as chloroplastic, 249 (66%) were 
found to be involved in an incongruency placing them 
among Bacteria. 

Coelomata/Ecdysozoa controversy 

The issue of whether coelomates form a single clade, 
the Coelomata, or whether all animals that moult an 
exoskeleton (such as the coelomate arthropods and the 
pseudocoelomate nematodes) form a distinct clade, the 
Ecdysozoa, is still a major open-ended subject in evo- 
lutionary biology. While single-gene based phylogenies 
supported the Ecdysozoa hypothesis [12-14], a first wave 
of phylogenomic analyses mostly favored the Coelomata 
[15-17]. More recently, those results were challenged by 
other genomic-scale studies [18-20], shifting again the bal- 
ance toward Ecdysozoa. Here we present a short example 
showing how it is possible to retrieve the trees supporting 
either hypothesis. 

Using TPMS, the patterns to find are (//Fungi , 
( /Nematoda/Nematoda , ( / Chordata/ Chordata , 
/Ar t hropoda/Ar t hropoda ) $ 8 0 ) $ 8 0 ) and ( / / 
Fungi, ( /Chordata/Chordata , (/Nematoda/ 



Nematoda,/Arthropoda/Arthropoda) $8 0) $8 0), 
respectively for Coelomata and Ecdysozoa hypotheses 
with a bootstrap support > 80%. We performed the 
search into a collection built with the HOMOLENS 
database, this with several thresholds for bootstrap 
values (Table 1). Excepted for the search performed 
with no minimal value for bootstrap scores, the results 
returned were systematically in favor of the Coelomata 
hypothesis. 

Note that this small example is only aimed at showing 
the possibilities provided by TPMS in terms of tree pat- 
tern search. To fully validate the result presented above 
it would be necessary to go back to the sequence align- 
ments and test for the relative evolutionary rates of indi- 
vidual genes, this in order to avoid systematic biases 
leading to reconstruction artefacts such as long branch 
attraction. Indeed, the only nematode genome available 
in HOGENOM 5 is Caenorhabditis elegans, and this 
organism is known to have genes with high evolution- 
ary rates, leading frequently to such kind of artefacts 
[21]. 

Mammals phylogeny 

In order to find sets of orthologs in a specific taxon it is 
possible to query a collection using a pattern correspond- 
ing to the subtree of a species tree containing the given 
taxon. If a gene tree contains exactly this pattern, then 
the most parsimonious hypothesis is that the matching 
sequences are orthologous. The search for exact patterns 
in TPMS requires a direct link between nodes, which is 
specified by the exclamation mark. For instance, to query 
a collection to find orthologous genes inside the Roden- 
tia order, we must select all the gene trees containing 
the TPMS pattern ( ( ( ( ! //Rattus Norvegicus , 
! //Mus Musculus) ! , !//Dipodomys ordii) ! , 
!// Cavia Porcellus) !, !// Spermophilus 
Tridecemlineatus) . This feature can be used to test 
the support level for nodes in a species tree. As an exam- 
ple, we used the mammals phylogeny published by Dos 
Reis et al. [11]. For that purpose, we extracted all possible 
subtrees in this tree and used the corresponding patterns 
to perform searches in HOMOLENS. Complete listing of 
the patterns in TPMS format is provided as Additional 
file 1. 



Table 1 Number of tree patterns matching Coelomata or 
Ecdysozoa hypotheses 





no thr. 


>80% >85% 


>90% 


>95% 


Coelomata 


768 


241 174 


110 


54 


Ecdysozoa 


1309 


149 83 


45 


20 



The queries corresponding to the two hypotheses have returned the numbers 
of patterns shown in the table. Five different thresholds have been used for 
bootstrap, from no threshold to values higher than 95%. 
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Figure 6 shows the tree topology of the Dos Reis et al 
phylogeny in which the node level of support is given 
as the ratio of the number of families matching the pat- 
tern over the total number of eligible families (/.e., the 
HOMOLENS families containing the species found in the 
subtree). As expected, both numbers decrease as we move 
from the leaves to the root of the tree due to the fact that 
deep nodes correspond to subtrees containing a growing 
number of taxa. A consequence is that the level of sup- 
port also decreases when moving to the deepest parts of 
the tree. This goes to an extreme with the nodes corre- 
sponding to Boreoeutheria, Eutheria and Theria as they 
have a support equal to zero in terms of families matching 
the pattern. This is due to the fact that it is not possible to 
specify fuzzy patterns in which some species are optional 
for the search. This would allow to put constraint on the 
placement of a species in a tree, without rejecting patterns 
which do not include this species. As we are restricted 



to strict patterns, finding families containing a large 
number of species and exactly matching this pattern is 
difficult. 

This method can also be used to test alternative 
hypotheses in a tree. For instances, the Pegasoferae 
node has the lowest support (168/6466) when look- 
ing at all the other nodes located at the same level 
in the tree. Moreover, Waddell et al. [22] made the 
hypothesis that Chiroptera is in fact a sister group of 
Equus caballus. To test this hypothesis we searched 
for the number of gene families supporting it. For 
that purpose, we used the pattern ((!//Equus 
caballus, ( ! //Pteropus vampyrus , !//Myotis 
lucif ugus) ! ) ! , ( ( ! //Canis lupus f amiliaris , 
! //Ailuropoda melanoleuca) ! , !//Felis 
catus) !). With this pattern, we obtained a support 
level of 6/6466 for Pegasoferae, which is significantly 
lower than the one observed with the original topology. 



Euarchonto- 
glires 
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Rattus norvegicus 
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■ Dipodomys ordii 

■ Cavia porcellus 

■ Spermophilus tridecemlineatus 

- Ochotona princeps 

- Oryctolagus cuniculus 

- Tupaia belangeri 

- Otolemur garnettii 

- Microcebus murinus 

- Tarsius syrichta 

- Callithrix jacchus 

- Macaca mulatta 

- Pongoabelii 
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■ Pan troglodytes 

- Homo sapiens 

- Erinaceus europaeus 

- Sorexaraneus 

- Tursiops truncatus 

- Bos taurus 
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■ Vicugna paces 

- Pteropus vampyrus 

- Myotis lucifugus 

• Equus caballus 

- Canis familiaris 

- Felis catus 

- Dasypus novemcinctus 

- Choloepus hoffmanni 

- Procavia capensis 

- Loxodonta africana 

- Echinops telfairi 

- Monodelphis domestica 

- Macropus eugenii 
Ornithorhynchus anatinus 

Figure 6 Mammal phylogeny support through the number of matching gene trees. For each node, the number of families matching the 
pattern over the number of eligible families is given. Red arrow shows the branch displacement required to obtain an alternative topology in which 
Chiroptera is a sister group of £ cobollus and their father node is a sister group of Cornivoro. 
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Discussion and conclusion 

TPMS is a useful set of programs that can be used 
in a wide range of phylogenetic problems. It has high 
performance due to its implementation in a compiled 
language and the fact that all individual programs are mul- 
tithreaded. Multithreading allows an optimal use on the 
multi-core architectures that are now common even for 
desktop computers. For instance, the two pattern searches 
(with no threshold for bootstrap values) performed on a 
collection containing 14190 trees to identify the ones sup- 
porting the Coelomata or the Ecdyzosoa hypotheses took 
only six seconds on an Intel Xeon 2.67 GHz with four 
cores. 

For the tree pattern search, the most recent program 
available to perform this kind of task is PhyloPattern 
[23]. It is implemented in Java and Prolog and its users 
have to write their queries as Prolog predicates, which 
can be very difficult for non-experts. Also, this program 
cannot be used to root trees or to search for HGTs. 
Moreover, its performance will be far below the ones 
of TPMS, due to the fact it is written with interpreted 
languages. 

For the automated tree rooting procedure, different 
alternatives to the two strategies proposed by TPMS 
are available. Due to its simplicity, the classical mid- 
point method looks like a good choice, but this is usu- 
ally a poor solution. Indeed, midpoint rooting implies 
that the molecular clock hypothesis is approximately 
true for the sequences used and, most of the time, this 
is not the case [24]. Other common approaches are 
mostly based on algorithms identifying roots that min- 
imize the number of gene duplication and losses, but 
they require reconciled trees and therefore they need 
a reliable reference species tree [1,25,26]. In the case 
of TPMS, the rooting strategy based on the use of an 
unicity score does not require the availability of such 
reference tree. This alternative is useful in the case of 
gene trees containing prokaryotic sequences, as no trusted 
species tree exist for these organisms. Also, reconcilia- 
tion methods usually performs poorly with prokaryotic 
sequences as they do not introduce the possibility of HGTs 
events. 

The application of TPMS for efficient large-scale iden- 
tification of putative HGTs would require some improve- 
ments. Presently, the algorithm is already able to search 
for incongruent patterns in which a subtree is responsible 
for a topological shift from the reference species tree. The 
problem is that the automated procedure provided does 
not guarantee that the rooting is optimal for prokaryotic 
species when performing "generic" searches {e.g., searches 
among the whole Bacteria or Archaea). If there is a lot of 
HGTs in a tree, our algorithm can lead to a wrong root- 
ing, causing partially wrong taxonomic assignments, and 
therefore false positives. On the other hand, using TPMS 



with an explicit transfer pattern (a taxon nested in another 
when it is not supposed to be according to the species 
tree) can be very efficient. A method like Prunier, which is 
specifically devoted to the detection of HGTs is of course 
more efficient than TPMS in that area but the drawback is 
that it is limited to unicopy gene families [5]. 

In the future, we have planned to add functionalities 
mainly to increase the user-friendliness. For instance, 
a module automatically completing the species names 
when building the file containing the associations between 
sequence names and the species to which they belong 
is under development. This module works with the 
remote-acnuc library [27] and requires that the con- 
sidered species are referenced in the general sequence 
collections. 

Availability and requirements 

Project name: TPMS 

Project home page: http://pbil.univ-lyonl.fr/software/ 
tpms/ 

Operating system(s): Unix-like operating systems such as 

Linux and MacOSX (Intel) 

Progamming language: C++ 

Other requirements: Bio++ and Boost libraries 

License: GNU GPL 

Additional file 



Additional file 1 : Set of TPMS patterns corresponding to all possible 
subtrees from the tree shown in Figure 6. 
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